Network analysis based on big data in social media of Korean adolescents’ diet behaviors

Adolescents are increasingly interested in weight control; hence, proper health education is important for helping them control their weight properly. This study was designed to pick out social media words that express adolescents’ diet behaviors, and identify the associations and types between such words and the behaviors. It used text-mining techniques and semantic network analysis for related big data collected from the Internet on adolescents’ diet behaviors. Text mining was used to extract meaningful information from unstructured text data, whereas semantic network analysis was used to understand the relationships between keywords. The top five keywords were “obesity,” “health,” “exercise,” “eat,” and “increase” in online news, and “exercise,” “eat,” “weight loss,” “obesity,” and “health” in blogs. The betweenness centrality of “appearance” was particularly higher than that of other centralities in online news. As a result of the CONCOR analysis, eight clusters each were identified in online news and blogs. This study’s results will serve as a basis for weight management-related intervention strategies, reflecting the perspectives of adolescents. It also has significance as basic data to provide correct information, and establish desirable weight control in the future.


Introduction
As adolescents are increasingly interested in weight control and diet behaviors, proper health education is important to help them in controlling their weight properly. The obesity rate among Korean adolescents (12.1%) is increasing yearly, with 34.6% adolescents attempting to lose weight and 23.9% having a distorted body image [1]. Likewise, the National Health and Nutrition Examination Survey data revealed that during 2007-2008, approximately 18.1% of [12][13][14][15][16][17][18][19] year-old in the United States were obese, which increased to 21.2% during 2017-2018 [2]. Maintaining healthy diet behaviors is a challenge for adolescents. In recent years, diet education interventions have increasingly relied on computing and information technologies, especially mobile platforms and social media [3]. As adolescents display a high level of smartphone and social media usage, they are more likely to use these platforms for monitoring their health [4]. Korean adolescents' Internet usage time, excluding for learning purposes, was 112.2 and 189.6 min on weekdays and weekends, respectively [1]. Most adolescents already rely on smartphones to search for health information [4]. While Internet use for education and communications has potential advantages, there are growing concerns about problematic Internet use. As such, providing correct information is important considering the high rate of weight loss attempts among adolescents, and the large amount of time that they spend on smartphones. Additionally, a survey on adolescents' health education needs found that they strongly desired advice on weight control [5]. A meta-analysis of studies on improving adolescents' health habits revealed that greater beneficial effects on health behaviors can be guaranteed by providing adolescents with helpful information to motivate them [6].
Big data are not merely a voluminous quantity of data that can be collected, stored, and analyzed [7], or the technology for processing large amounts of data [8]; rather, their essence lies in the value than can be created from such data. The core of big data technology lies in its ability to provide valuable new information and services by analyzing information that pours in. Therefore, collecting and analyzing online information on diet behaviors-a topic that adolescents are most interested in-will provide useful basic information to adolescents, who spend long hours on social media, and help them to grow into healthy adults.
Network analysis is a useful method for deriving the characteristics of the network type, and explaining the features of topics of interest by relationship [9]. It can be used for analyzing users' thought patterns based on content posted on social media, using text-mining techniques, and is, therefore, useful for understanding the context of connections between networked content [10]. Furthermore, such analyses and visualizations have the advantage of facilitating a grasp of the knowledge structure of the phenomenon of interest, and showing the direction [11]. Therefore, by analyzing and categorizing the connectivity of big data-based collection, analysis, and processing, the characteristics and structure of the contents related to the diet behaviors of adolescents-a phenomenon of interest-are identified.
Previous studies that attempted big data-based network analysis on adolescents had considered their peer relationships, smoking and drinking experiences [12], and peer networks according to their physical factors [9], and used semantic network analysis for assessing the knowledge structure of students with severe and multiple disabilities [13]. Another study on physical activity and exercise in school-aged youth aimed to provide a solution by analyzing a large number of scientific articles using text mining [14]. A study has also been conducted to analyze Korean adolescents' perceptions of sports and physical activities through big data analysis over the last 10 years, and provide research data and statistical direction with regard to their participation in such activities [15]. Under the premise that social media plays an important role in young people's daily lives, a study describing a big data approach to social media has been presented. The study exemplified this approach by analyzing an ad hoc dataset from the pro-eating disorder forum of a social media website [16]. During a review of previous studies, it was difficult to find a study that had used network analysis based on big data in social media to explore the diet behaviors of adolescents, despite the increasing number of studies using big data-based network analysis in various academic fields [17].
Therefore, this study was designed to provide basic data for establishing strategies to prevent adolescent obesity, which is increasing yearly, and establish desirable weight control, using social media for big data-based network analysis of Korean adolescents' diet behaviors. Hence, its purpose was to identify social media words that expressed adolescents' diet behaviors, and identify the associations between such words and their types.

Materials and methods
The diet behaviors of adolescents were analyzed using text-mining techniques and semantic network analysis for related big data collected from the Internet. Text mining is the process of extracting meaningful information from unstructured text data to explore key topics and trends from multiple perspectives. Semantic network analysis is used to understand the relationships between keywords. In this study, the following analysis process was established to understand the meaning of words based on their associations related to adolescents' diets in online news articles and blogs. The overall analysis process is shown in Fig 1.

Data collection
We collected data on adolescents' diet from online news and blogs in Naver [18] and Daum [19], which are the two largest portals in Korea. Using the search keyword "adolescents' diet," we collected 1,423 online news articles from Naver News and 1,733 blog posts from the Naver and Daum blogs.
Online news article texts were collected only from Naver because almost all Korean news articles can be found in Naver, and articles inevitably got duplicated when both Naver and Daum were searched. Many articles were duplicated in Naver because various news media provide the same articles. Hence, we removed duplicate articles using cosine similarity, which refers to the degree of similarity between vectors measured, using cosine values of the angles between two vectors in space. As Naver and Daum blogs are rarely duplicated, blog posts were collected from both the sites. Naver and Daum represent 74.1 and 18.7% share of Korean blog sites, respectively [20].
The data were collected using a web crawling program implemented in Python. We overcame the anti-crawling strategy of websites using the Selenium library, which automates web browsers. Web-crawling data were processed using the BeautifulSoup Library, and saved in the DataFrame format of the Pandas library.

Data extraction and preprocessing
Data preprocessing was performed using KoNLPy, an open source Python library for natural language processing in Korea [21]. The collected data were refined using nouns, verbs, and adjectives, except for special characters and symbols, through morphological analysis using KoNLPy. After extracting the word list, Term Frequency-Inverse Document Frequency (TF-IDF) was calculated from a morpheme of one or more words.
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is performed by multiplying two metrics: the term frequency of the document and the inverse document frequency of the word across a set of documents. This weight value is mainly used to obtain similarity in documents, as well as the importance of search results in searches and of specific words within a document.
Not every word in the dataset was considered as the co-occurrence matrix node, but by using the word-frequency lists, words whose frequencies were less than certain cut-off values were excluded. In addition, the words that commonly appeared across all datasets were also ruled out because they are less meaningful in detecting differences in the semantic networks derived from distinct datasets [22,23].
For keyword selection, it is desirable to select the most appropriate word for the research topic, while referring to the opinions of experts [24]. Therefore, in this study, the top 50 words were selected based on their TF-IDF values, which reflected the opinions of a high school counselor, public health teacher, and network analysis expert. When selecting words, unrelated words, such as "person" and "society," were excluded, and words similar in meaning were incorporated. For example, all the frequencies of "fat," "overweight," and "gain weight," which were similar to that of "obesity," were added to the frequency of "obesity." Based on these 50 selected words, a Document-Term Matrix (DTM) was generated to represent the frequency of each word appearing in multiple articles and blogs. A DTM is meaningful in that it can quantify the relationship between words and documents. Subsequently, a Co-Occurrence Matrix (COM) was generated to determine the frequency of simultaneous appearances of words in the entire document.
Because the generated COM is complex to analyze, using the median of its all elements as cut-off value, it was transformed into a binary matrix by changing to 1 for a value higher than the median value, and 0 for a value lower than the median value. This task involved creating a loose relationship by simply comparing excessively dense values with 1 and 0 in the network analysis. We used the binary matrix as keyword COM in semantic network analysis. A network represented by keyword COM is an unweighted and undirected network.

Semantic network analysis and visualization
Semantic network analysis was used to understand the relationship between refined words related to adolescents' diet. It is a mixed method of social network analysis that identifies the structural characteristics of social phenomena, and uses data mining techniques for analyzing unstructured big data [22]. To intuitively recognize the co-occurrence relationship among the refined words in the social media data, the COM that was created in the previous section was visualized using NetDraw, a network visualization program [25].
To identify the connection structure of words related to adolescents' diet, NetworkX, a Python package [26], was used to analyze the following network centralities: 1) degree centrality-the number of nodes a particular node is connected to; 2) betweenness centrality-a measure of the mediation role of a node in a network; 3) closeness centrality-the inverse of the mean distance to all other nodes, which indicates how close a node is to all other nodes; and 4) eigenvector centrality-a measure of the influence of a node in a network [27].
A CONvergence analysis of an iterative CORrelation (CONCOR) was performed to identify mutually exclusive subgroups in the semantic network. CONCOR repeatedly partitions nodes into subsets based on structural equivalence, and analyzes Pearson's correlations to search for groups with certain levels of similarity. It forms clusters, including nodes with similarities to each other [28]. This method is generally used to identify the relationship between simultaneous nodes of keywords across all possible keywords, by finding clusters of similar keywords [29]. We used UCINET 6.0 [30] to perform the CONCOR analysis, and the results were visualized using NetDraw.

Semantic network of clusters from CONCOR analysis related to adolescents' diets
CONCOR analysis was conducted on cluster words based on their structural equivalence relationship by analyzing Pearson's correlation in COM. Fig 2 shows Fig 3 shows the results of the CONCOR analysis of the adolescents' diet network constructed from blogs, called Blog Network, and the eight clusters that were identified. The cluster [calorie, worry, physical constitution, life, take dose herbal medicine, habit, menu, make, konjac, product] could be seen as words related to "eating habits." The cluster [treatment, appetite suppressant, use, needed, appearance, prescription] referred to "methods, except food and exercise." The cluster [meal, consult, exercise, increase] could be interpreted as "it is desirable to increase consultations on exercise and meals." The cluster [function, diverse, problem,

Discussion
This study was conducted to provide basic data for establishing a strategy for preventing adolescent obesity, which is increasing yearly, and establish desirable weight control strategies by analyzing online data on the diet behaviors of adolescents using text-mining techniques. Among the words extracted by text mining on adolescents' diets, the top five words with high frequency were "obesity," "health," "exercise," eat," and "increase" in online news, and "exercise," "eat," "weight loss," "obesity," and "health" in blogs. This result was consistent with those of a study, in which "exercise" and "health" were the keywords with the highest frequency in the 2016 diet status analysis through big data by selecting Naver, the most used portal in Korea, as an analysis target [31]. In a previous study, the word "menu" was included in the top three, whereas in this study, the word "menu" was ranked relatively low-38 th and 34 th for online news and blogs, respectively. Although the previous study [31] had no age restrictions, and this study was limited to adolescents, both studies have shown high frequencies of "exercise" and "health." These results suggest that diet is beneficial for health, regardless of age, and in relation to diet, exercise is the most important factor.
What stood out in the centrality analysis of online news was that the betweenness centrality of "appearance" was particularly higher than that of the other centralities. Thus, it can be considered that appearance acts as a bridge connecting others. For example, in the centrality analysis of keywords extracted from online news, the significance of "appearance" in adolescents' diet behaviors, such as considering themselves obese after seeing an entertainer's appearance, or choosing diet products after seeing advertisements, was confirmed. Adolescence is a period of rapid physical growth and social development, when interest in one's appearance increases. Adolescents' values and attitudes toward their appearance are easily influenced by the mass media or their peer groups. Additionally, as this study's results have been supported by studies stating that even non-obese adolescents are highly preoccupied with their appearance, such as erroneously recognizing their body type as being obese, it shows that the betweenness centrality of "appearance" is particularly high in centrality analysis [24,32]. Disordered weight control behaviors should be considered when developing education programs to establish desirable weight control, given their prevalence among Korean adolescents [33], and their association with stress and depressive symptoms [24,34].
In this study, the issues identified from the CONCOR cluster analysis of online news and blogs were somewhat different. Based on the results of the CONCOR cluster analysis of keywords extracted from online news, the following can be inferred regarding intervention in adolescents' diet behaviors. First, during diet interventions, emphasizing education on side effects and how to prevent them is necessary. Second, entertainers and advertisements can affect adolescents' diets, so this point should be reflected in diet-related education. Third, referring to online news rather than blogs is better because online news has more content on diet-related education and information. Obesity treatment drugs have problems of side effects and abuse, and especially since a large-scale clinical study has not yet been conducted for adolescents, more attention is required. In contrast, the spread of a distorted sense of beauty in favor of an overly skinny body encourages the indiscriminate use of anti-obesity drugs; hence, safety issues are constantly being raised related to the overuse, dependence, and misuse of psychotropic appetite suppressants [35,36]. Therefore, it supports this study's results, showing a significant interest in side effects and their prevention, following the use of therapeutics for weight loss. Diet inspiration-related information or slender models seen in the media affect individuals' perceptions of their body image, which also affects their self-attitudes, such as body dissatisfaction [37,38]. This is consistent with this study's results, in which appearance had the second highest value in the betweenness centrality analysis of online news, and the results of the CONCOR analysis showed that a cluster consisting of entertainers and advertisements could influence adolescents' diets. Another result confirmed in the CONCOR analysis of online news was that many content items were related to diet-related education and information. This is supported by the statement that online newspapers lend themselves to be used as a "research medium" for more information on issues that one is already interested in [39]. Based on the results of the CONCOR analysis of keywords extracted from blogs, the following can be inferred regarding the intervention in adolescents' diet behaviors. First, it is necessary to emphasize the importance of food intake and diet for weight control. Second, it has been confirmed that adolescents have so much interest in body shape that this point will be reflected in the intervention. Third, since there is a lot of information about weight loss in blogs, it is necessary to reflect on information and education with reference to them.
Comparing the results of the CONCOR analysis of online news and of blogs, online news contained more education and information, such as how to eat, non-obese food, and the side effects of diet (weight control), whereas blogs contained more content on intake, body shape, and weight loss. This suggests that the differences in authors' subjective thoughts and direct experiences are used as the main basis for blogs, whereas online news is focused on delivering objective information and explanations, based on the values of fairness and responsibility [40].
Diet and food-related content on social media may influence people's diets and weight-loss behaviors. Visual cues, such as images or videos of food, increase the likelihood of eating and gaining weight [41]. Moreover, research has shown increased marketing potential for unhealthy foods and beverages through social media [42]. This is consistent with the results of the CONCOR analysis of keywords extracted from blogs, showing that education on the importance of intake and diet should be emphasized. Adolescents are interested in weight control and prefer a skinny body; and even if their weight falls within the standard range or below, they still want to lose weight and follow a diet [32]. This supports the results of the CONCOR analysis of the keywords extracted from blogs in this study. Adolescents' subjective perceptions of being underweight and overweight were positively associated with problematic Internet use. Considering this, careful attention needs to be paid to adolescents' inappropriate weight control behaviors [43].
Although adolescence is a period in which physical and physiological growth along with development must be sufficiently achieved, excessive expectations for a slim body are highly likely to cause physical and psychological problems, such as damaging health and lowering self-esteem. Therefore, to prevent problems caused by extreme and excessive weight loss, it is necessary to provide reasonable monitoring standards for the media mainly used by adolescents, such as TV and the Internet, as well as education to critically select information and properly accept it. Furthermore, correct and educational information should be provided, so that adolescents can have more positive self-perceptions and personal satisfaction about their physical appearance, and thereby establish a desirable self-identity.

Conclusions
Although information related to adolescents' diets is widely available on the Internet, we collected data from Naver News, Naver blogs, and Daum blogs to obtain better crawling results, given that Google search results on this topic are mainly news and blogs. This study was limited by its search terms. Data were collected using the search term "adolescents' diets," along with similar and related words. The collected data may depend on the range of the similar words that were selected. In this study, data were collected only in the Korean language from Korean portal sites. Although information on adolescents' diets are available from websites worldwide, the data were collected in a single language to guarantee consistency with keyword selection. Despite these limitations, this study's outcomes were significant. As it analyzed data extracted from online news and blogs, its results will serve as a basis for intervention strategies for weight management, reflecting the perspectives of adolescents, who have a high rate of weight loss attempts, and spend a lot of time on smartphones. Its results can also be used as basic data to help establish and provide correct information to adolescents for establishing desirable weight control in the future and helping them to grow into healthy adults.